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ABSTRACT 

In  this  study  we  present  approaches  to  multilin- 
gual speech  recognition.  We  first  define  different  ap- 
proaches, namely  portation,  cross-lingual  and  simul- 
taneous multilingual  speech  recognition  and  present 
results  in  these  approaches.  In  recent  years  we  have 
ported  our  recognizer  to  other  languages  than  Ger- 
man. Some  experiments  presented  here  show  the  per- 
formance of  cross-lingual  speech  recognition  of  an  un- 
trained language  with  a recognizer  trained  with  other 
languages.  Our  results  show  that  some  languages  like 
Italian  are  per  se  easier  to  recognize  with  any  of  the 
recognizers  than  other  languages.  The  substitution 
of  phones  for  cross-lingual  recognition  is  an  impor- 
tant point  and  we  compared  results  in  cross-lingual 
recognition  for  different  baseline  systems  and  found 
that  the  number  of  shared  acoustic  units  is  very  im- 
portant for  the  performance. 

1.  INTRODUCTION 

Over  the  years  we  have  studied  speech  recognition 
and  speech  understanding  systems  in  German,  and  as 
more  and  more  multilingual  applications  are  needed, 
the  ISADORA  system  was  also  used  for  multilingual 
speech  recognition  [1,  8]. 

The  need  for  multilingual  speech  recognition  appli- 
cations has  risen  for  example  by  the  growing  inter- 
nationalism like  within  the  European  Community  or 
in  telecommunications.  Thus,  applications  are  devel- 
oped for  recognition  in  a new  language,  for  example 
dictation  systems  are  ported  to  a new  language  or 
information  systems  are  developed  for  e.  g.  tourist 
information  at  airports  and  train  stations  which  have 
to  be  able  to  understand  a couple  of  languages. 
When  developing  a recognition  system  for  a new  lan- 
guage either  exclusively  for  the  new  language  or  for 
the  new  language  in  addition  to  existing  languages, 
the  recognition  system  optimized  for  the  first  lan- 
guage has  to  be  adapted  to  the  characteristics  of  the 
new  language. 

During  this  process,  mainly  data  like  the  vocabulary, 
acoustic  parameters,  language  models,  and  the  dialog 
structure  have  to  be  adapted.  Most  of  these  adapta- 
tions have  already  been  performed  before,  e.  g.  when 
porting  a system  to  a new  domain.  One  topic  is  still 
specific  to  the  portation  to  a new  language:  the  defi- 
nition and  the  use  of  acoustic  units.  If  the  recognizer 
is  completely  rebuilt  for  a new  language  with  training 
material  of  that  language,  the  definition  of  new  acous- 


tic units  arises  from  the  pronunciation  of  the  words  in 
the  vocabulary,  but  when  there  is  not  sufficient  train- 
ing material  available  for  the  new  language  or  when 
two  languages  are  recognized  at  the  same  time,  the 
acoustic  units  of  the  old  and  the  new  language  have 
to  be  set  in  relation.  This  problem  and  solutions  to 
it  will  be  the  central  aspect  in  this  contribution. 

In  the  following,  we  will  cluster  approaches  of  mul- 
tilingual speech  recognition  in  order  to  provide  clear 
definitions  for  the  different  approaches  and  describe 
characteristics  of  these  approaches.  Then  we  will 
shortly  describe  the  available  data  material  for  our 
experiments  and  present  different  strategies  of  phone 
substitution  during  the  transition  of  languages.  We 
will  present  experiments  and  results  for  different  ap- 
proaches of  multilingual  speech  recognition  and  phone 
substitution  techniques. 

2.  DEFINITIONS 

When  looking  at  the  approaches  made  in  multilin- 
gual speech  recognition,  we  find  that  they  may  be 
clustered  into  three  groups  depending  on  the  ap- 
plication goal  and  available  data,  namely  porting, 
cross-lingual  recognition  and  simultaneous  multilin- 
gual speech  recognition. 

When  a speech  recognition  system  developed  for  one 
language  is  used  for  recognition  in  another  language, 
we  speak  of  porting.  This  step  is  similar  to  that  of  de- 
veloping an  application  in  a new  domain  of  the  same 
language.  The  vocabulary  and  the  acoustic  units  have 
to  be  defined  for  the  new  language.  Special  attention 
must  be  paid  to  characteristics  of  languages  like  ho- 
mophones or  compound  words  and  other  characteris- 
tics affecting  the  recognition  process.  For  these  char- 
acteristics, algorithms  have  to  be  found  that  can  cope 
with  these  new  problems.  The  system  is  then  trained 
with  data  of  the  new  language.  This  approach  can  be 
found  for  example  in  [2,  3,  11]. 

Another  approach  follows  the  same  application  goal 
as  the  approach  above  with  the  only  difference,  that 
there  is  not  sufficient  training  material  available  in 
the  new  language.  Thus,  for  cross-lingual  recogni- 
tion methods  must  be  found  to  use  training  material 
of  another  language  for  a rough  modeling  of  acoustic 
parameters  and  only  to  perform  an  adaptation  with 
few  data  of  the  goal  language.  One  main  problem  is  to 
determine  identical  acoustic  units  or  to  model  exist- 
ing acoustic  units  in  a way  that  with  few  adaptation 
data  a good  recognition  can  be  provided.  Approaches 
of  this  kind  can  be  found  for  example  in  [4,  7]. 
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The  third  cluster  of  approaches  is  that  of  simultane- 
ous multilingual  recognition  . Applications  of  this  ap- 
proach allow  utterances  of  different  languages  at  the 
same  time  for  the  same  recognition  system.  There 
are  a two  main  strategies  for  this  approach:  firstly,  to 
perform  some  kind  of  language  identification  and  per- 
form then  monolingual  recognition  or  to  have  only  one 
recognizer  that  distinguishes  in  some  way  between  the 
languages.  For  this  latter  strategy,  identical  acous- 
tic units  may  be  used  across  the  languages  or  com- 
pletely different  acoustic  units  as  well  as  sets  of  mono- 
and  multi- lingual  acoustic  units.  Also,  for  language 
modeling,  it  may  be  determined  between  multi-  and 
monolingual  language  modeling,  which  also  means 
that  transitions  between  languages  are  allowed  or  not. 
Approaches  for  simultaneous  speech  recognition  can 
be  found  for  example  in  [1,  8,  10]. 

3.  DATA  BASES 

The  data  used  in  our  experiments  result  from  three 
projects:  the  EU  project  SQEL  (Spoken  Queries 
in  European  languages),  the  EU  project  SpeeData 
(Speech  Recognition  for  Data-Entry),  and  from  the 
BMBF  project  VerbMobil. 

The  SQEL  project  covers  the  languages  Slovak, 
Slovenian  and  Czech  in  an  information  system  for 
train  and  flight  time  tables.  The  SpeeData  project 
covers  the  languages  Italian  and  German,  both  spo- 
ken by  dialect  and  non-natives  speakers.  The  task  of 
the  project  is  the  entry  of  land  register  data  in  the 
bilingual  region  of  South  Tyrol  in  the  original  lan- 
guage, thus  the  rate  of  non-native  speech  will  always 
be  around  50  percent.  The  VerbMobil  project  deals 
with  date  scheduling  among  humans  in  Japanese,  En- 
glish and  German  including  automatic  translation 
among  the  languages. 

An  overview  on  the  training  data  used  from  these 
projects  is  given  in  Table  1.  With  these  data,  we 
cover  seven  languages  (German  (Gl,  G2)  , Ital- 
ian (It),  Slovak  (Sa),  Slovenian  (Se),  Czech  (Cz), 
Japanese(Jp),  and  English  (En)),  while  German  is 
covered  twice.  The  German  data  assigned  with  Gl 
result  from  the  SpeeData  project  and  contain  di- 
alect and  non-native  speakers  whereas  the  data  set 
G2  from  the  VerbMobil  project  covers  only  native 
German  speech. 


Language 

Gl 

It 

Sa 

Se 

Data/hours 

8.6 

7.6 

5.1 

6.1 

Distinct  vocabulary 

5455 

6748 

1061 

955 

Cz 

Jp 

En 

G2 

Data/hours 

7.2 

27.4 

9.6 

28.5 

Distinct  vocabulary 

1323 

3207 

2157 

7444 

Table  1.  Acoustic  data  for  each  language 

The  data  consist  of  spontaneous  speech  for  most  of 
the  languages,  only  for  Gl  and  Italian  read  speech 
was  recorded.  Due  to  the  high  amount  of  non-natives 
and  dialect  speakers  who  often  try  to  speak  the  stan- 
dard language  there  are  a couple  of  hesitations  and 
corrections. 

The  size  of  the  vocabulary  differs  much  among  the  dif- 
ferent tasks  and  languages.  The  smallest  vocabulary 


size  is  observed  for  the  train/flight  information  do- 
main with  around  thousand  words  per  language.  For 
the  other  domains,  land  register  data-entry  and  date 
scheduling  the  vocabulary  is  higher  and  varies  among 
2000  and  7000  words  depending  on  the  language.  For 
the  experiments  we  tried  to  limit  the  recognition  vo- 
cabulary to  a smaller  and  equal  size  for  all  languages 
in  the  experiments  without  language  modeling,  but 
left  the  original  size  of  the  lexicon  for  the  experiments 
with  language  models. 

4.  PHONE  SUBSTITUTIONS 

Each  language  has  its  own  characteristic  set  of  pho- 
netic units,  and  from  the  phones,  different  phoneme 
systems  may  be  built.  For  example,  in  Japanese,  no 
distinction  is  made  between  /r/  and  /l/  and  they 
would  thus  belong  to  the  same  phoneme  class  in 
that  language,  whereas  in  other  languages  they  are 
phonemes  classes  on  their  own  since  a semantic  dif- 
ference occurs  such  that  words  get  a new  meaning 
when  e.  g.  /r/  is  replaced  by  /l/.  Some  sounds  are 
also  unique  to  some  languages,  for  example  the  vowel 
/y/  appears  within  these  languages  only  in  German. 
If  recognition  is  performed  for  German  with  a rec- 
ognizer that  was  trained  with  other  languages,  the 
sound  /y/  must  be  modeled  although  it  was  not  repre- 
sented in  the  training  material.  Thus,  the  parameters 
of  /y/  must  be  estimated  from  other  vowels  like  /I/. 
Sometimes  there  is  the  same  symbol  used  for  sounds 
of  different  languages,  but  the  acoustic  properties  dif- 
fer for  these  sounds.  When  recognizing  multiple  lan- 
guages simultaneously,  it  may  thus  be  reasonable  to 
share  some  sounds  across  languages  and  to  stay  with 
monolingual  units  for  other  sounds. 

Thus,  for  both  approaches  of  cross-lingual  and  simul- 
taneous multilingual  recognition,  relations  and  simi- 
larities among  sounds  of  different  languages  must  be 
found. 

In  general,  we  can  distinguish  between  a 1:1  mapping 
of  phones  between  languages  and  a n:l  or  l:m  map- 
ping of  phones,  which  would  mean  that  for  example 
the  parameters  of  /y / are  estimated  as  e.  g.  the  mean 
values  of  /I/  and  /u/.  In  this  work  we  will  refer  to  the 
first  strategy  of  a 1:1  mapping.  In  a rough  classifica- 
tion, we  distinguish  among  three  different  approaches 
within  the  1:1  mapping. 

na(t)ive  approach:  this  approach  follows  the  prin- 
ciple a non-native  would  follow  when  speaking  a 
second  language:  he  basically  has  the  phonetic 
inventory  of  the  first  language  and  partially  uses 
that  inventory  when  speaking  the  second  lan- 
guage. Some  of  the  new  phones  can  be  learnt 
by  a language  learner,  but  they  are  not  always 
pronounced  correctly,  and  under  stress  condition 
or  within  difficult  words  a non-native  may  fall 
back  to  his  native  phonetic  inventory.  For  exam- 
ple Japanese  speaking  English  or  German  often 
confuse  the  use  of  /r/  and  /l/. 

phonetic  approach:  this  strategy  follows  principles 
in  the  production  of  sounds  in  the  human  vocal 
tract.  These  characteristics  for  the  production  of 
sounds  can  be  classified  into  place  and  manner 
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of  production,  where  the  first  describes,  where 
obstacles  are  put  in  the  air  flow  and  which  organs 
are  involved  in  the  production  of  sounds,  and  the 
second  one  describes  the  manner  in  which  the 
obstacles  act,  e.  g.  a complete  or  partial  closure 
of  the  air  flow. 

Thus,  for  consonants  it  can  be  distinguished 
with  regard  to  the  manner  among  stop- 
fricative-approximant-lateral-rhotics  and  others 
and  for  the  place  between  labial-dental-alveolar- 
palatal- velar-alveolar  and  others.  Another  crite- 
rion is  the  voicing  of  consonants  which  can  be 
either  voiced  or  unvoiced.  For  vowels,  differ- 
ent tongue  positions  are  distinguished  like  front- 
central-back,  and  for  the  opening  of  the  mouth 
among  close-close-mid-open- mid-open  as  well 
as  between  rounded  and  unrounded  for  the  shape 
of  the  lips. 

The  difference  between  consonants  is  clearer 
than  between  vowels,  e.  g.  a plosive  has  a com- 
plete closure,  while  others  do  not  have  a complete 
closure,  and  there  is  no  sound  between  e.  g.  a 
plosive  and  a fricative.  For  vowels,  the  position 
of  the  tongue  can  gradually  change  and  there  are 
transitions  between  a front  and  a central  vowel, 
so  the  distinction  and  classification  of  vowels  can 
be  more  difficult. 

For  the  substitution  of  sounds  in  this  approach, 
that  sound  that  agrees  in  the  most  phonetic  fea- 
tures with  the  untrained  one  is  taken  instead  of 
the  unknown  one  of  the  goal  language.  For  ex- 
ample, /p/  (plosive,  labial,  unvoiced)  may  be  re- 
placed by  /b/  (plosive,  labial,  voiced)  or  by  /t/ 
(plosive,  dental,  unvoiced).  Some  hierarchy  has 
to  be  built  in  order  to  define  which  of  the  criteria 
will  be  changed  first. 

data-driven  approach:  this  approach  determines 
the  similarity  among  phones  with  the  data  given 
by  the  trained  recognizer.  This  approach  is  only 
possible  if  there  is  training  data  available  for  the 
new  language,  i.  e.  some  adaptation  data  or  for 
the  case  of  simultaneous  multilingual  recognition 
for  the  decision  if  acoustic  units  should  be  joined. 
Measures  for  the  similarity  can  e.  g.  be  estimated 
from  the  Gaussian  densities  or  the  codebook  pa- 
rameters of  a trained  recognizer.  Therefore  a 
recognizer  must  be  trained  with  all  languages, 
and  for  all  observations  of  a language-dependent 
sound  the  similarity  parameters  like  mean  val- 
ues must  be  estimated  and  then  according  to  a 
distance  measure  the  most  similar  units  may  be 
joined.  This  merging  of  units  can  happen  in  one 
or  more  steps  and  it  may  also  be  allowed  to  split 
units.  The  advantage  of  this  approach  is  that 
there  is  no  human  knowledge  or  manual  work 
necessary  to  estimate  similarities,  but  the  dis- 
advantage may  lie  in  an  exact  determination  of 
the  segmentation  of  the  speech  signal  into  sounds 
and  consequently  an  error  prone  measure  for  sim- 
ilarities among  sounds. 

The  phonetic  description  of  consonants  separates  bet- 
ter into  classes  while  measures  for  the  classification 


of  vowels  correlate  with  formant  frequencies  and  of 
these  formant  frequencies  every  compromise  between 
two  vowels  of,  say,  500  and  600  Hertz  is  possible  and 
thus  really  different  sounds  may  occur.  On  the  other 
hand,  this  characteristic  may  make  it  easier  to  calcu- 
late the  parameters  of  sounds  by  mixing  sounds  which 
would  average  in  the  same  formant  frequency. 
Another  decision  is  the  type  of  acoustic  units  that 
will  be  used  for  the  target  recognizer,  especially  if 
the  units  ought  to  be  mono-  or  multilingual.  For  ex- 
ample, to  decide  for  n available  languages  each  con- 
taining the  sound  /a/,  if  the  sound  /a/  for  the  target 
language  (without  own  training  material)  shall  result 
from  one  /a/  of  a language  or  from  a mixture  of  a 
certain  number  of  /a/’s.  With  substitution  approach 
1 and  two,  the  multilingual  units  may  be  trained  to- 
gether, and  with  approach  3 it  may  be  determined 
according  to  the  data  if  all  or  only  a couple  of  /a/’s 
shall  have  an  influence  on  the  modeling  of  the  new 
/*/• 

Comparing  the  results  of  these  different  strategies  for 
phone  substitution  it  can  be  found  that  approaches 
1 and  2 are  quite  similar,  of  course  depending  on 
the  priorities  set  for  substitution  to  manner  or  place 
in  approach  2.  Differences  occur  mostly  when  the 
orthography  proposes  the  pronunciation  of  another 
native  sound  than  the  similarity  according  to  acous- 
tic features  would  propose  it.  For  example,  in  the 
na(t)ive  approach,  /u/  may  be  replaced  by  /U/  ac- 
cording to  the  same  orthographic  spelling  [u]  rather 
than  to  the  possibly  phonetically  closer  /of  if  the  cor- 
responding criterion  is  chosen. 

Approach  3 is  only  possible  if  a certain  amount  of 
data  is  available  for  all  languages;  in  general  it  is  used 
for  the  design  of  multilingual  acoustic  units.  Errors  in 
this  approach  can  occur  if  there  is  not  sufficient  data 
available  for  each  language  and  thus  the  parameters 
have  not  been  well  estimated.  Another  source  of  error 
for  the  third  approach  may  be  given  when  the  label- 
ing of  the  speech  material  according  to  acoustic  units 
is  not  completely  correct,  e.  g.  with  automatic  seg- 
mentation. Sometimes,  silence  is  assigned  to  a certain 
sound  and  changes  this  way  the  statistic  properties  of 
this  sound. 

Another  source  for  errors  may  be  different  recording 
conditions.  A consequence  may  be  that  sounds  of  the 
same  language  without  respect  to  their  phonetic  fea- 
tures are  estimated  as  more  similar  than  any  sound  of 
the  other  language.  In  our  experiment,  this  happened 
for  Slovenian  sounds  which  were  for  many  cases  more 
similar  than  any  sound  of  another  language. 

One  special  phenomenon  that  has  arisen  in  data- 
driven  decision  is  the  similarity  of  /j/  and  /z/ 
which  have  quite  different  phonetic  characteristics 
(approximant-palatal-voiced  vs.  fricative-alveolar- 
voiced) , which  has  also  been  shown  in  several  other 
approaches  [5,  6],  thus  there  may  be  some  other  mea- 
sures important  besides  the  phonetic  features  deter- 
mined so  far. 

5.  EXPERIMENTS 

For  our  recognition  experiments  we  used  the 
ISADORA  recognizer  [9]  with  semi-continuous  Hid- 
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den  Markov  Models.  We  performed  experiments  both 
with  and  without  language  models,  for  the  experi- 
ments without  language  models  we  used  a reduced 
recognition  vocabulary  in  order  to  limit  the  perplex- 
ity of  the  task. 

Instead  of  the  technique  of  poly  phones  with  context- 
dependent  acoustic  units  we  only  used  monophones 
with  the  phone  itself  and  no  context  around.  The 
performance  decreases  by  using  context-free  acoustic 
units,  but  only  with  these  units  we  can  hold  the  num- 
ber of  acoustic  units  and,  even  more  important,  the 
number  of  necessary  substitutions  at  a relatively  low 
level. 

As  baseline  systems,  we  ported  our  recognition  sys- 
tem to  the  new  languages  and  use  the  performance 
obtained  with  monolingual  recognizers  for  our  cross- 
lingual  experiments. 

Concerning  acoustic  units,  we  considered  sounds  rep- 
resented by  the  same  phonetic  symbol  as  identical, 
and  thus,  for  our  cross-lingual  experiments,  we  have 
to  replace  those  phones  whose  symbol  does  not  oc- 
cur in  the  target  language.  Furthermore,  we  did  not 
count  replacements  for  the  length  of  phones,  i.  e.  if 
there  existed  only  a long  vowel  like  /i:/  and  the  short 
correspondent  /i/  was  needed,  we  did  not  count  this 
as  substitution.  The  same  is  done  for  Italian  gemi- 
nates, thus  /nn/  was  set  equal  to  /n / and  the  substi- 
tution was  not  counted. 

In  Table  2 the  number  of  substitutions  across  lan- 
guages is  shown.  There  are  no  substitutions  between 
G1  and  Italian  since  they  share  proper  names  of 
both  languages  and  thus  phones  of  both  languages 
are  modeled  for  each  recognizer.  Between  G1  and 
G2  there  are  two  substitutions  for  originally  Italian 
phones  (/J /,  /L/)  which  are  used  in  the  Gl  recog- 
nizer. There  is  a high  number  of  substitutions  be- 
tween the  Germanic  languages  (English,  German) 
on  the  one  side  and  the  Slavic  languages  (Slovak, 
Slovenian,  Czech)  on  the  other  side,  once  due  to  the 
high  number  of  consonants  modeled  in  the  Slavic  lan- 
guages and  the  high  amount  of  vowels  in  the  Ger- 
manic languages. 

Furthermore,  we  can  observe,  that,  using  the 
Japanese  recognizer  for  the  recognition  of  any  of  the 
other  languages,  a high  number  of  substitutions  has 
to  be  made,  since  the  phone  inventory  of  the  Japanese 
language  is  small  in  comparison  to  those  of  the  other 
languages.  On  the  other  hand,  for  recognition  of 
Japanese  with  any  other  recognizer,  only  a small 
number  of  substitutions  has  to  be  performed. 
Furthermore,  we  have  listed  in  that  table  also  the 
number  of  substitutions  for  multilingual  recognizers, 
and,  of  course,  the  number  of  substitutions  decreases 
with  respect  to  the  corresponding  monolingual  recog- 
nizers, although  the  complete  phone  inventory  can- 
not be  covered  with  three  languages  for  all  others. 
We  have  found  out  that,  besides  Japanese,  that  the 
phone  inventory  of  the  remaining  6 languages  can 
only  be  covered  without  substitution  only  when  all 
6 languages  are  involved  into  training,  thus  there  is 
no  real  multilingual  inventory  possible  with  a subset 
of  these  languages. 

We  performed  experiments  with  na(t)ive  and  pho- 
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Table  2.  Substitution  of  phones  with  different  languages 
and  recognizers 


netic  substitution  as  well  as  some  preliminary  exper- 
iments with  data-driven  substitution  for  the  cross- 
lingual  experiments. 

6.  RESULTS 

The  experiments  performed  for  this  contribution  are 
done  without  optimization,  i.  e.  without  using  the 
technique  of  polyphones  for  acoustic  units,  without 
using  a polygram  verification  for  language  modeling 
and  without  optimizing  the  training  procedure  in  or- 
der to  obtain  recognizers  trained  at  the  same  level. 
Thus,  the  results  given  here,  do  not  correspond  to  the 
optimally  trained  recognizers,  but  are  comparable  to 
each  other  with  respect  to  modeling  and  training. 
Results  of  the  experiments  with  language  modeling 
are  given  in  Table  3 for  monolingual  and  cross-lingual 
recognition,  where  the  monolingual  results  are  shown 
in  the  diagonal.  We  also  give  some  experiments  for 
multilingually  trained  recognizers  in  the  second  part 
of  that  table. 

Using  different  strategies  for  phone  substitution  did 
not  lead  to  significant  differences  between  the  na(t)ive 
and  the  phonetic  approach,  but  often  the  na(t)ive  ap- 
proach seems  lightly  better  compared  to  the  replacing 
strategy  proposed  by  [5].  With  data-driven  substitu- 
tion, we  found  substitutions  that  correspond  roughly 
to  phonetic  similarities  for  Italian  and  Gl  data,  but 
for  other  languages  the  similarities  do  not  correspond 
to  phonetic  properties.  For  Slovenian,  for  example, 
the  phones  classified  as  most  similar  were  in  most 
cases  also  Slovenian  phones,  probably  the  recording 
conditions  dominated  over  the  phonetic  similarities. 
For  all  languages  besides  German  G2,  recognition  is 
best  for  the  monolingual  recognizer  trained  with  data 
of  that  language  and  domain.  For  G2,  recognition 
showed  to  be  better  for  the  bilingual  German-English 
recognizer  under  these  conditions. 

The  performance  among  the  languages  differs  from 
37  % for  G2  to  94  % for  Italian.  There  are  vari- 
ous reasons  for  this  difference:  the  domains  have  a 
different  difficulty,  in  the  SpeeData  task  the  best 
recognition  is  achieved,  followed  by  SQEL  and  fi- 
nally the  VerbMobil  task.  There  are  different  types 
of  speech  and  other  recording  conditions  with  hesita- 
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Table  3.  Recognition  results  for  cross-lingual  experiments 


tions,  background  noise  etc.  Furthermore,  the  size 
of  the  vocabulary  is  different  for  each  language.  Fi- 
nally, the  languages  themselves  differ  in  the  difficulty 
for  recognition,  some  languages  may  be  easier  to  be 
recognized  than  others  due  to  the  phonetic  structure, 
word  length  and  other  reasons. 

In  order  to  compare  the  performance  of  the  cross- 
lingual  recognizers  trained  with  one  language  we  av- 
eraged the  performance  of  all  recognizers  besides  the 
one  of  the  original  language  and  domain.  Best  cross- 
lingual  recognition  averaged  over  the  seven  other  rec- 
ognizers weis  achieved  for  Italian  with  78.73  %,  worst 
performance  was  achieved  for  G2  with  11.37  %.  The 
ranking  in  the  recognition  rate  remains  the  same  with 
respect  to  the  monolingual  recognition  experiments, 
only  Czech  moves  one  step  which  could  be  interpreted 
that  Czech  is  easier  to  recognize  than  Slovenian  which 
moved  that  step  down. 

Furthermore,  we  calculated  the  ratio  of  the  loss 
of  performance  by  dividing  the  cross- lingual  perfor- 
mance by  the  monolingual  performance  and  obtain 
the  same  ranking.  Here,  Italian  obtains  85.56  % of 
the  recognition,  thus  the  loss  of  performance  when 
recognizing  with  other  languages  is  below  15  % on 
average,  while  for  G2  with  30.65  % only  one  third  of 
the  performance  is  achieved. 

These  both  calculations  are  difficult  for  interpretation 
since  the  similarity  of  languages  and  thus  the  recog- 
nizability  cannot  be  taken  into  account,  for  example 
we  have  two  German  recognizers  in  the  cross-lingual 
experiments.  Assuming  a higher  similarity  among 
the  Slavic  languages,  the  cross-lingual  performance 
should  be  higher  when  recognizing  with  Slavic  rec- 
ognizers for  the  Slavic  languages  than  for  the  oth- 
ers. Furthermore,  the  cross-lingual  recognition  of 
Japanese  could  be  worse  because  there  are  no  lan- 
guages similar  to  Japanese  used  for  recognition. 
From  these  numbers,  we  can  observe,  that  starting 
with  a poor  recognition  rate  for  monolingual  recog- 
nition, the  performance  for  cross-lingual  experiments 
suffers  more  than  for  languages  and  domains  where 
the  performance  is  already  higher  itself. 

Averaging  the  performance  of  cross-lingual  recogniz- 
ers on  different  spoken  languages,  we  find,  that,  for 
monolingually  trained  recognizers,  the  best  cross- 
lingual  performance  was  achieved  by  the  Slovenian 
recognizer  which  lead  three  times  to  the  best  cross- 


lingual  recognition,  whereas  Czech,  English  and 
Japanese  never  performed  best,  thus  the  Slovenian 
recognizer  seems  to  be  best  for  cross-lingual  recogni- 
tion in  this  task.  The  similarity  among  languages 
and  therefore  their  reciprocal  cross-lingual  perfor- 
mance has  a high  ranking  compared  to  other  lan- 
guages. Only  Slovak  and  Slovenian  showed  mutu- 
ally the  best  performance  for  cross-lingual  recognizers 
and  may  therefore  be  assumed  similar  for  this  speech 
recognition  task,  although  theoretically,  Slovak  and 
Czech  should  be  more  similar  than  those  two  lan- 
guages. 

For  other  languages,  there  is  no  such  symmetry  ob- 
servable, even  the  two  German  recognizers  do  not 
lead  to  highest  reciprocal  results:  Gl  recognizes  best 
G2,  but  not  vice  versa.  This  may  be  due  to  differ- 
ent speaking  styles,  but  more  probable  to  the  differ- 
ent speakers,  since  the  speakers  of  Gl  speak  with  a 
dialect  and  with  a non-native  accent,  while  the  G2 
speakers  are  German  natives  and  do  not  speak  with 
a strong  dialect. 

With  multilingual  recognizers,  trained  with  several 
languages,  performance  is  worse  than  with  the  ap- 
propriate monolingual  recognizer.  Having  the  target 
language  not  included  into  training,  the  performance 
is  better  than  with  cross-lingu£d  monolingual  recog- 
nizers. Unfortunately,  for  those  languages  which  have 
the  highest  cross-lingual  performance,  no  multilingual 
recognizers  were  trained,  thus  often  the  best  mono- 
lingual cross-lingual  recognizers  perform  better  than 
the  best  multilingual  recognizers  trained  in  these  ex- 
periments. 

Of  the  available  multilingual  recognizers,  the  G2- 
English- Japanese  recognizer  performs  best  for  these 
data,  possibly  due  to  a larger  variety  in  the  models 
provided  by  Japanese  in  addition  to  the  Germanic 
languages  models. 

7.  CONCLUSION 

In  this  contribution,  we  compared  the  performance 
of  different  monolingual  recognizers  with  respect  to 
cross-lingual  recognition.  We  found  with  our  exper- 
iments with  non-optimized  recognizers  (only  mono- 
phones, no  polygram  verification  in  the  language 
models,  no  optimization  in  the  training),  that  besides 
the  German  G2  task,  performance  is  best  for  mono- 
lingual recognizers.  The  performance  of  the  different 
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languages  differs  due  to  the  different  difficulty  of  the 
task  and  also  due  to  differing  recognizability  of  the 
languages. 

When  monolingual  recognition  is  already  bad,  cross- 
lingual  performance  gets  even  worse.  Thus,  for  Ital- 
ian, the  average  decrease  in  performance  is  15  %, 
whereas  for  G2  only  one  third  is  recognized  with  re- 
spect to  the  monolingual  recognizer.  Cross-lingual 
performance  does  not  show  strong  symmetry  in  the 
recognition,  only  Slovak  and  Slovenian  recognize  ut- 
terances of  the  other  language  better  than  any  other 
language. 

When  recognizing  with  multilingual  cross-lingual  rec- 
ognizers, performance  gets  better  than  with  the  corre- 
sponding monolingual  recognizers.  Unfortunately,  we 
have  not  trained  all  combinations  of  recognizers,  so 
the  combination  of  the  best  monolingual  cross-lingual 
recognizers  could  not  always  be  tested. 

Concluding,  we  found  for  these  languages  and  do- 
mains, that  best  performance  is  obtained  with  mono- 
lingual recognizers.  For  cross-lingual  recognition,  the 
choice  of  the  language  for  training  the  recognizer 
is  important  for  the  performance.  Furthermore,  we 
found  that  performance  increases  if  training  data  of 
more  languages  are  involved  and  thus  both  acoustic 
units  are  modeled  with  more  variety  and  more  train- 
ing material  as  well  as  more  different  acoustic  units 
are  modeled  overall. 
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